In this project we analyze the data from IMBd's movie dataset in order to determine what factors effect the gross income of a movie.
IMDb is one of the world's largest online movie databases, containing reviews and user ratings for over 500,000 movies.
The dataset used in this project contains all movies on IMDb with more than 100 user ratings, as of 01/01/2020. The dataset was scraped from IMDb's website https://www.imdb.com and uploaded to the data science website Kaggle by a Kaggle user. The Kaggle page for this dataset can be accessed at https://www.kaggle.com/stefanoleone992/imdb-extensive-dataset. Note: Kaggle is a great resource for finding tons of datasets for your future projects!
The Python libraries used in this project are the following:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import cpi
import collections
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
df = pd.read_csv("IMDb-movies.csv")
/home/alex/.miniconda/lib/python3.8/site-packages/IPython/core/interactiveshell.py:3165: DtypeWarning: Columns (3) have mixed types.Specify dtype option on import or set low_memory=False. has_raised = await self.run_ast_nodes(code_ast.body, cell_name,
Loading our dataset into a dataframe produced the following warning: "DtypeWarning: Columns (3) have mixed types.Specify dtype option on import or set low_memory=False." The loading function involves guessing what kind of data is in a given column, and this warning means that two different datatypes have been found in the same column. Let's keep this in mind for when we tidy the dataframe.
Since our table has a large number of columns, let's first change a display option so that all of the columns will be shown when we print our dataframe:
pd.set_option('display.max_columns', None)
Here's the first 5 rows of our dataframe (you may need to use the horizontal scrollbar to see all of the columns):
df.head()
| imdb_title_id | title | original_title | year | date_published | genre | duration | country | language | director | writer | production_company | actors | description | avg_vote | votes | budget | usa_gross_income | worlwide_gross_income | metascore | reviews_from_users | reviews_from_critics | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | tt0000009 | Miss Jerry | Miss Jerry | 1894 | 1894-10-09 | Romance | 45 | USA | None | Alexander Black | Alexander Black | Alexander Black Photoplays | Blanche Bayliss, William Courtenay, Chauncey D... | The adventures of a female reporter in the 1890s. | 5.9 | 154 | NaN | NaN | NaN | NaN | 1.0 | 2.0 |
| 1 | tt0000574 | The Story of the Kelly Gang | The Story of the Kelly Gang | 1906 | 1906-12-26 | Biography, Crime, Drama | 70 | Australia | None | Charles Tait | Charles Tait | J. and N. Tait | Elizabeth Tait, John Tait, Norman Campbell, Be... | True story of notorious Australian outlaw Ned ... | 6.1 | 589 | $ 2250 | NaN | NaN | NaN | 7.0 | 7.0 |
| 2 | tt0001892 | Den sorte drøm | Den sorte drøm | 1911 | 1911-08-19 | Drama | 53 | Germany, Denmark | NaN | Urban Gad | Urban Gad, Gebhard Schätzler-Perasini | Fotorama | Asta Nielsen, Valdemar Psilander, Gunnar Helse... | Two men of high rank are both wooing the beaut... | 5.8 | 188 | NaN | NaN | NaN | NaN | 5.0 | 2.0 |
| 3 | tt0002101 | Cleopatra | Cleopatra | 1912 | 1912-11-13 | Drama, History | 100 | USA | English | Charles L. Gaskill | Victorien Sardou | Helen Gardner Picture Players | Helen Gardner, Pearl Sindelar, Miss Fielding, ... | The fabled queen of Egypt's affair with Roman ... | 5.2 | 446 | $ 45000 | NaN | NaN | NaN | 25.0 | 3.0 |
| 4 | tt0002130 | L'Inferno | L'Inferno | 1911 | 1911-03-06 | Adventure, Drama, Fantasy | 68 | Italy | Italian | Francesco Bertolini, Adolfo Padovan | Dante Alighieri | Milano Film | Salvatore Papa, Arturo Pirovano, Giuseppe de L... | Loosely adapted from Dante's Divine Comedy and... | 7.0 | 2237 | NaN | NaN | NaN | NaN | 31.0 | 14.0 |
And here's our list of columns (with their data types) for convenience:
df.dtypes
imdb_title_id object title object original_title object year object date_published object genre object duration int64 country object language object director object writer object production_company object actors object description object avg_vote float64 votes int64 budget object usa_gross_income object worlwide_gross_income object metascore float64 reviews_from_users float64 reviews_from_critics float64 dtype: object
We're gonna need to change some of the values in our dataframe before we can properly use it. Firstly, let's figure what the problem was with column 3. It's the year column, so all of its values are supposed to be ints. The loading function must have seen some values that weren't ints, such as strings. Let's see if we can find non-numeric string values in the year column:
years = df['year'].tolist()
print("Non-numeric year values:")
for i in range(len(years)):
if not str(years[i]).isdigit():
print("Row " + str(i) + ": " + str(years[i]))
Non-numeric year values: Row 83917: TV Movie 2019
Huh, it turns out it's just one row with some non-numeric characters in front of the year. Even so, it's still good practice to fix issues like this by tidying every row. Here's how to remove all of the non-numeric characters in the year column, and then set the year column to be a column of ints.
#First, we cast our year column - which is currently a Series object - to a Series of strings.
#(A Series is a Pandas datatype that's similar to a list, but has extra functionality.)
#Then we use the apply() function to perform a certain function on each value in the Series and stores the resulting value of that function.
#The function we are applying takes all of the numeric characters of a given string and returns them as a string.
df['year'] = df['year'].astype(str).apply(lambda x: ''.join(i for i in x if i.isdigit()))
#setting the year column as a column of ints
df['year'] = df['year'].astype(int)
Now that that's done, let's change "worlwide_gross_income" (which has a typo) into a more managable name:
df = df.rename(columns={'worlwide_gross_income': 'income'})
Next, let's get rid of some of all of the data columns that aren't useful for analyzing what influences gross income. Such columns are those related to movie titles, people involved with the movie (this could be an important factor, but would require a much more complex analysis), values representing contemporary reviews (they can't effect the income due to being contemporary), and any redundant data.
df = df[['year', 'genre', 'duration', 'country', 'budget', 'income']]
Another important bit of tidying is changing budget and income values from dollar amounts to raw integers. Really, we just need to remove the dollar signs from those values. We should also set those columns to be float columns while we're at it.
#removing dollar signs:
df['budget'] = df['budget'].str.replace('\$', '', regex=True)
df['income'] = df['income'].str.replace('\$', '', regex=True)
#If a value is in a currency other than USD, the currency symbol will still show up.
#To keep things easy, let's just set all remaining non-numeric values to NaN using the following lines:
df['budget'] = df['budget'].apply(lambda x: pd.to_numeric(x, errors='coerce'))
df['income'] = df['income'].apply(lambda x: pd.to_numeric(x, errors='coerce'))
#setting the columns to floats:
df['budget'] = df['budget'].astype(float)
#df['income'] = df['income'].astype(float)
The final bit of tidying we need to do is to remove any rows that have a NaN value for the worldwide_gross_income column, since that column is the focus of our analysis:
df = df.dropna(subset=['income'])
df = df.reset_index().drop(['index'], axis=1)
And here's the result of our tidying:
df.head()
| year | genre | duration | country | budget | income | |
|---|---|---|---|---|---|---|
| 0 | 1916 | Drama, Fantasy, Horror | 63 | Russia | NaN | 144968.0 |
| 1 | 1920 | Fantasy, Horror, Mystery | 76 | Germany | 18000.0 | 8811.0 |
| 2 | 1921 | Drama | 107 | Norway | NaN | 4272.0 |
| 3 | 1920 | Comedy, Drama, Romance | 75 | USA | NaN | 772155.0 |
| 4 | 1921 | Drama, Romance, War | 150 | USA | 800000.0 | 9183673.0 |
Now that our data has been tidied, we can start looking at how different variables relate to one another, broadly speaking. Understanding the basic relationships between different variables will help us understand which relationships to look at further in the later stages of our data analysis. In order to understand these relationships, we need to visualize them via plotting them out.
The first relationship we're gonna plot out is how the average income changes over time. Since we are graphing a single variable over time, we should use a line graph. We can easily make line graphs using the Matplotlib module:
#first we group the rows by their year, then for each grouping, we take the mean of that grouping's income
#this gives us a dataframe containing the mean income for every year
df_by_year = df.groupby(['year'])['income'].mean().reset_index()
#next need to make a Matplotlib figure to plot the data on, then plot our data on that figure
fig, ax = plt.subplots()
x = df_by_year['year']
y = df_by_year['income']
ax.plot(x, y)
#always title and label your plots so you know what you're looking at
ax.set_title("Average worldwide gross income for movies over time\n")
ax.set_xlabel("Year")
ax.set_ylabel("Average Gross Income\n(in tens of millions of USD)")
fig.show()
Looking at our graph we can see that generally, average income increases over time. There are a few crazy spots on our graph though; those spikes in the 1940s propably have to do with World War II. We should try to think about why our data says what it does. The general increase in income over time has to be affected by monetary inflation, but it may also have to with the general growth of the movie industry over time. In order to figure out how much of this is caused by inflation, let's plot our average incomes adjusted for inflation:
#here we make a list of each income values adjusted for inflation
#we use a try/except because the inflation function will throw an error if it doesn't have inflation values for a given year
lst = []
for index, row in df.iterrows():
try:
lst.append(cpi.inflate(row['income'], row['year']))
except:
lst.append(np.NaN)
df['adj_income'] = lst
#now we generate and plot the means just like before, but with adjusted income
df_by_year = df.groupby(['year'])['adj_income'].mean().reset_index()
fig, ax = plt.subplots()
x = df_by_year['year']
y = df_by_year['adj_income']
ax.plot(x, y)
#always title and label your plots so you know what you're looking at
ax.set_title("Average worldwide gross income for movies over time, adjusted for inflation\n")
ax.set_xlabel("Year")
ax.set_ylabel("Average Gross Income\n(in hundreds of millions of USD)")
fig.show()
From this graph, we can see that the increasing trend over time was caused due to inflation.
Big-hollywood movies tend to be the most popular. One would assume that means there's a correlation between budget and income. Since this relationship may not flow linearly from left to right like time does, we shouldn't plot this as a line graph. Instead, let's make a scatter plot, using Seaborn:
fig, ax = plt.subplots()
#Seaborn can make plots very easily
ax = sns.regplot(data=df, x='budget', y='income', truncate=False, line_kws={"color": "red"})
ax.set_title("Budget vs Worldwide Gross Income\n")
ax.set_xlabel("Budget (in hundreds of millions of USD)")
ax.set_ylabel("Gross Income (in billions of USD)")
fig.show()
Looking at all of our points, we can see a (fuzzy) positive relationship between budget and income. The trend line plotted on top of the scatter plot gives a more precise picture of this correlation. And since both budget and income are affected by inflation, we probably don't need to calculate a version of this plot adjusted for inflation. This doesn't look like a very tight correlation, but it is a correlation.
So how can duration and income be related? Perhaps big Hollywood productions tend to have a certain duration, creating a correlation between duration and income? If there is a relationship the scatter plot will show it:
fig, ax = plt.subplots()
ax = sns.regplot(data=df, x='duration', y='income', truncate=False, line_kws={"color": "red"})
ax.set_title("Duration vs Worldwide Gross Income\n")
ax.set_xlabel("Duration (minutes)")
ax.set_ylabel("Gross Income (in billions of USD)")
fig.show()
Well this doesn't look very promising at all. The points aren't really showing much of a relationship at all. It's just of big blob of points. The trend line can't plot a meaningful line through our data; there is no way to draw a line of best fit for this data. Looking at this, we should assume that there's not a meaningful linear relationship between duration and income in our dataset.
This one's gonna be a bit harder to do, because countries are discrete values, as opposed to continuous values like time. Furthermore, a movie can have multiple countries! The best way to plot relationship is a bar graph, but first we need a better way to represent the countries associated with each movie. One method of doing this is by making a dataframe column for each country. Each movie associated with a given movie will have a 1 in that movie's column, and a 0 otherwise. (This approach will be especially helpful later, when we are generating regressions.) But we can't make a column for every country, that would be excessive. Let's find which countries have the most movies associated with them:
#we make a list containing every instance of a country name from the country column of our dataframe
#creating a list of lists of country names by splitting on commas
con_list = list(df['country'].dropna().str.split(','))
#flattening our list of lists
con_list = [item for sublist in con_list for item in sublist]
#removing whitespace in case the names have irregular whitespace
con_list = [x.strip() for x in con_list]
#we create a dictionary giving a frequency count for each country name, then turn that dictionary into a dataframe so that we can sort by frequency
freq = pd.DataFrame.from_dict(collections.Counter(con_list), orient='index', columns=['frequency'])
freq = freq.sort_values(by=['frequency'], ascending=False)
freq.head(10)
| frequency | |
|---|---|
| USA | 11339 |
| France | 4307 |
| UK | 3131 |
| Germany | 2252 |
| India | 1936 |
| Japan | 1610 |
| Italy | 1533 |
| Canada | 1380 |
| Spain | 1340 |
| South Korea | 976 |
It looks like the top 5 countries are USA, France, UK, Germany, and India, so those are the countries that we are going to use. Now, let's make country columns in our dataframe:
#removing all irregular spacing
df['country'] = df['country'].str.replace(' ', '')
#the get_dummies function allows us to easily make our country columns
df_con = df['country'].str.get_dummies(sep=',')
#we only need the countries from the top 5
df_con = df_con[['USA', 'France', 'UK', 'Germany', 'India']]
#joining our country columns to the main df
df = df.join(df_con)
#we won't need the original country column going forward
df = df.drop(['country'], axis=1)
Let's see what our dataframe looks like now, with the new country columns:
df.head()
| year | genre | duration | budget | income | adj_income | USA | France | UK | Germany | India | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1916 | Drama, Fantasy, Horror | 63 | NaN | 144968.0 | 3.442139e+06 | 0 | 0 | 0 | 0 | 0 |
| 1 | 1920 | Fantasy, Horror, Mystery | 76 | 18000.0 | 8811.0 | 1.140192e+05 | 0 | 0 | 0 | 1 | 0 |
| 2 | 1921 | Drama | 107 | NaN | 4272.0 | 6.176763e+04 | 0 | 0 | 0 | 0 | 0 |
| 3 | 1920 | Comedy, Drama, Romance | 75 | NaN | 772155.0 | 9.992110e+06 | 1 | 0 | 0 | 0 | 0 |
| 4 | 1921 | Drama, Romance, War | 150 | 800000.0 | 9183673.0 | 1.327841e+08 | 1 | 0 | 0 | 0 | 0 |
Now that our country data is better orginized, let's plot the relationship between income and country using a bar graph:
#getting the movies for each country
#con_dfs = {}
#for i in ['USA', 'France', 'UK', 'Germany', 'India']:
# con_dfs[i] = df.loc[df[i] == 1][['year', 'income']]
#our x value will be our list of countries
#x = list(con_dfs.keys())
x = ['USA', 'France', 'UK', 'Germany', 'India']
#our y value will be the mean income for each country
#y = [i['income'].mean() for i in list(con_dfs.values())]
y = []
for i in x:
df_ = df.loc[df[i] == 1]
y.append(df_['income'].mean())
#sorting our x and y values by the y values, so that our bars will be in order from shortest to tallest
tuples = zip(*sorted(zip(y, x)))
y, x = [list(i) for i in tuples]
fig, ax = plt.subplots()
ax.bar(x, y)
ax.set_title("Average Worldwide Gross Income of Movies by Country\n")
ax.set_ylabel("Average Gross Income\n(in tens of millions of USD)")
fig.show()
So mean income varies by country, with USA being the country (of the countries we're looking at) with the highest mean income. Given that the United States has Hollywood, that's not surprising.
Plotting this is going to be very similar to plotting countries vs income. Since there are fewer genres than countries, let's try and use all of the genres:
#generating the genre columns
df['genre'] = df['genre'].str.replace(' ', '')
df_genres = df['genre'].str.get_dummies(sep=',')
df = df.join(df_genres)
df = df.drop(['genre'], axis=1)
#storing the list of genres:
genres = list(df_genres.columns)
#getting our x and y
#genre_dfs = {}
#for i in genres:
# genre_dfs[i] = df.loc[df[i] == 1][['year', 'income']]
x = genres
y = []
for i in x:
df_ = df.loc[df[i] == 1]
y.append(df_['income'].mean())
#y = [i['income'].mean() for i in list(genre_dfs.values())]
tuples = zip(*sorted(zip(y, x)))
y, x = [list(i) for i in tuples]
#plotting
fig, ax = plt.subplots(figsize=(10, 10))
#horizontal bar graphs are better if your graph has a lot of bars
ax.barh(x, y)
ax.set_title("Average Worldwide Gross Income of Movies by Genre")
ax.set_xlabel("Average Gross Income (in hundreds of millions of USD)")
fig.show()
We can see that mean income varies based on genre. Interestingly, a big chunk of the genres have similar mean incomes.
Now that we've visualized the data, let's generate a statistical model of the factors influencing worldwide gross income. For this analysis, we are going to generate linear regressions.
A linear regression is similar to a linear equation (y = mx + b). A linear equation takes an input value, multiplies it by some scalar, and adds a constant to output an output value. A linear regression can take n input values (from the n input variables of the regression), multiply them by some scalars, add them together, add a constant, and output the sum. That sum is the prediction of what the output should be based on the input values, give or take some random error. A linear regression with n different independent variables can be visualized as the following equation:
Y = b0 + (b1 * x1) + (b2 * x2) + ... + (bn * xn) + random error
We generate linear regressions much like how we generate lines of best fit. Given a bunch of points on a grid, we can draw a line through them that represents the line of best fit. Now imagine you have a bunch of n-dimensional points. These points are the rows in your dataframe. You are using n - 1 columns as input variables, and 1 row as an output variable. Given these points, you can draw (using an algorithm in Python), a "line" of best fit. That "line" is your regression. (The trend lines we plotted earlier on top of our scatter plots were linear regressions for a bunch of 2-dimensional points.) Now, if you have a new set of input values, you can calculate where the output value should be based on your regression. Linear regressions have a random error because the models we make probably aren't taking every possible factor into account (and we don't know what factors we're missing). In order to see if our linear regression is accurate, we only use some of our dataset (the training set) to generate the regression. We call that subset of our data the training set. Then we test the regression on the rest of our data (the test set) to see how accurately the regression can predict that data's output values from its input values.
In order to test the effectiveness of our regressions, we will calculate an R-squared value for the training set and for the test set. An R-squared value is a measure of how accurately a regression can predict the actual output values of the data. It's how close our line of best fit gets to the actual points. An R-squared value of 1.0 means that the regression perfectly predicts the output values. The R-squared value for the training set will tell us how well our regression can predict the data used to generate it, and the R-squared value for the test set will tell us how well our regression can predict new data. If the training R-squared value is significant larger than the test R-squared value, that means the regression is significantly better at predicting existing data than new data. This is called overfitting, and it's a problem since the purpose of a regression is to predict new output values from new input values.
Before we can start generating linear regressions, there is a little bit of clean up we need to do. Since the linear regression generation function won't work on data containing NaN values, we need to remove any rows with NaN values in columns that we want to use in our regressions. Let's see what columns have NaN values:
print("Dataframe number of rows: " + str(df.shape[0]))
print("\nNumber of NaN values per column:")
print(df.isna().sum())
Dataframe number of rows: 30955 Number of NaN values per column: year 0 duration 0 budget 21930 income 0 adj_income 0 USA 0 France 0 UK 0 Germany 0 India 0 Action 0 Adventure 0 Animation 0 Biography 0 Comedy 0 Crime 0 Documentary 0 Drama 0 Family 0 Fantasy 0 Film-Noir 0 History 0 Horror 0 Music 0 Musical 0 Mystery 0 Romance 0 Sci-Fi 0 Sport 0 Thriller 0 War 0 Western 0 dtype: int64
From this we can see that a massive proportion of our data has NaN for the budget column. If we want to genreate regressions based on budget, we're going to have to use a much smaller dataset. Budget looked like a promising factor from our initial analysis however, so we should still pursue it. Let's make a second dataframe containing only rows with useable budget values:
df_bug = df.dropna(subset=['budget'])
df_bug = df_bug.reset_index().drop(['index'], axis=1)
#df_bug[['income', 'budget']] = RobustScaler().fit_transform(df_bug[['income', 'budget']])
Now we're ready to start generating regressions.
Before we make more complicated regressions, let's make a simple 2d regression for budget vs income using our df_bug dataframe. To do this, we are going to use the module Sklearn. We've already make this regression before using Seaborn, but SKLearn will allow us to make more complicated regression later on, and allow us to more easier test those regressions.
#first let's get our X variables (input variables) and Y variable (output variable)
#if your X only contain one variable, you will convert it to an np.array and reshape it using np.array.reshape(-1, 1)
X = np.array(df_bug['budget']).reshape(-1, 1)
Y = df_bug['income']
#next let's divide our X and Y data into a testing set and a training set
#we'll have 25% of our data in the test set, which is the standard proportion
#having a specific seed for randomization (random_state) means our results are replicable
Xtrain, Xtest, Ytrain, Ytest = train_test_split(X, Y, test_size = 0.25, random_state = 35)
#now let's make a regression object and train it on our training set
reg = LinearRegression()
reg.fit(Xtrain, Ytrain)
#let's take a look at what our equation looks like
print("Linear Regression Coefficients:\n" + str(reg.coef_) + "\n")
print("Linear Regression Intercept:\n" + str(reg.intercept_) + "\n")
print("Linear Regression Equation (x is budget, y is income):\ny = " + str(reg.coef_[0]) + ' * x + ' + str(reg.intercept_) + "\n")
#now let's get our r-squared values using the score() function
acc_train = reg.score(Xtrain, Ytrain)
acc_test = reg.score(Xtest, Ytest)
#let's see what our accuracies look like
print("Linear Regression Accuracy (R-squared) for Training Data: " + str(acc_train))
print("Linear Regression Accuracy (R-squared) for Test Data: " + str(acc_test))
print("\n")
#and let's plot our regression
fig, ax = plt.subplots()
ax.scatter(df_bug['budget'], df_bug['income'])
ax.plot(X, reg.predict(X), color='red')
ax.set_title("Linear Regression for Budget vs Income")
ax.set_xlabel("Budget")
ax.set_ylabel("Income")
fig.show()
Linear Regression Coefficients: [3.27406362] Linear Regression Intercept: -15229062.394591153 Linear Regression Equation (x is budget, y is income): y = 3.2740636202127504 * x + -15229062.394591153 Linear Regression Accuracy (R-squared) for Training Data: 0.5666269492892935 Linear Regression Accuracy (R-squared) for Test Data: 0.5591197402272339
So our regression has a little bit more than .5 accuracy. This means that budget is a fairly good predictor for income. There is also very little difference between our training and test accuracies, so this model will work on any new data. This model is based on a smaller dataset, however.
In addition to calculating R-squared values, we should analyze our residuals. A residual is the difference in the Y-value between a data point and the regression line. To represent our residuals for a given regression, we can plot our each of our data points, with each point's y-value changed to be that point's residual. If our model isn't missing some significant trend in the data, our residuals shouldn't follow any trend either. The line of best fit through our residuals should be y = 0. Let's plot the residuals for the regression we just made:
#generating our residuals
res = Y - reg.predict(X)
#plotting them
fig, ax = plt.subplots()
ax.scatter(X, res)
#plotting a trend line for our residuals
res_trend = LinearRegression()
res_trend.fit(X, res)
ax.plot(X, res_trend.predict(X), color='red')
trend = "y = " + str(res_trend.coef_[0]) + " * x + " + str(res_trend.intercept_)
#trend = np.polyfit(X[0], res, 1)
#trendline = np.poly1d(trend)
#ax.plot(X, trendline(X[0]), c='red')
ax.set_title("Residuals for Linear Regression for Budget vs Income\n")
ax.set_xlabel("Budget\n\nTrend line: " + trend)
ax.set_ylabel("Residuals")
fig.show()
So our residuals follow a trend line that's relatively close to y = 0, so our residuals have a mean close to zero. Furthermore, our residuals are mostly randomly scattered. However, our residuals become densely clustered around the mean near the left part of our graph. This implies that our regression model is missing some factor. Let's make some regressions with more x-variables!
In order to generate regressions based on complex sets of factors, let's write a general function for making and viewing regressions:
def gen_LinReg(df, xcols, ycol, formula=False, r2=False):
if not (formula or r2):
return
print("Linear Regression")
print("X = " + str(xcols))
print("Y = " + str(ycol) + "\n")
#getting X and Y
X = df[xcols]
#reshaping if X is one column
if len(xcols) == 1:
X = np.array(X).reshape(-1, 1)
Y = df[ycol]
#splitting data into training and test sets
Xtrain, Xtest, Ytrain, Ytest = train_test_split(X, Y, test_size = 0.2, random_state = 35)
#generating regression
reg = LinearRegression()
reg.fit(Xtrain, Ytrain)
#viewing regression
if formula:
print("Coefficients:\n" + str(reg.coef_))
print("\nIntercept:\n" + str(reg.intercept_) + "\n")
#viewing r2
if r2:
#getting r2
acc_train = reg.score(Xtrain, Ytrain)
acc_test = reg.score(Xtest, Ytest)
print("Accuracy (R-squared) for Training Data: " + str(acc_train))
print("Accuracy (R-squared) for Test Data: " + str(acc_test))
Now let's look at the regression for year, budget, and duration vs income:
gen_LinReg(df_bug, ['year', 'duration', 'budget'], 'income', formula=True, r2=True)
Linear Regression X = ['year', 'duration', 'budget'] Y = income Coefficients: [-3.09183340e+04 3.39113246e+05 3.20746419e+00] Intercept: 12189434.839708388 Accuracy (R-squared) for Training Data: 0.5659868927316661 Accuracy (R-squared) for Test Data: 0.5712998701736798
According to our R-squared values, this isn't really any more accurate than our regression based only budget. So how effective are the models based only on year, and only on duration? Let's see:
gen_LinReg(df_bug, ['year'], 'income', r2=True)
Linear Regression X = ['year'] Y = income Accuracy (R-squared) for Training Data: 0.015503387320112139 Accuracy (R-squared) for Test Data: 0.010664703761729788
gen_LinReg(df_bug, ['duration'], 'income', r2=True)
Linear Regression X = ['duration'] Y = income Accuracy (R-squared) for Training Data: 0.06321684125890437 Accuracy (R-squared) for Test Data: 0.06194652490740049
Those are both very ineffective models based on their R-squared values. Let's try a regression based on countries:
gen_LinReg(df, ['USA', 'France', 'UK', 'Germany', 'India'], 'income', r2=True)
Linear Regression X = ['USA', 'France', 'UK', 'Germany', 'India'] Y = income Accuracy (R-squared) for Training Data: 0.06759725355273039 Accuracy (R-squared) for Test Data: 0.06528237344772114
gen_LinReg(df, genres, 'income', r2=True)
Linear Regression X = ['Action', 'Adventure', 'Animation', 'Biography', 'Comedy', 'Crime', 'Documentary', 'Drama', 'Family', 'Fantasy', 'Film-Noir', 'History', 'Horror', 'Music', 'Musical', 'Mystery', 'Romance', 'Sci-Fi', 'Sport', 'Thriller', 'War', 'Western'] Y = income Accuracy (R-squared) for Training Data: 0.11696974116788017 Accuracy (R-squared) for Test Data: 0.09389917778136592
Another useless model. It looks like budget was by far the best factor for determining income. For fun, let's try a regression based on all of our factors:
gen_LinReg(df_bug, ['year', 'duration', 'budget', 'USA', 'France', 'UK', 'Germany', 'India'] + genres, 'income', r2=True)
Linear Regression X = ['year', 'duration', 'budget', 'USA', 'France', 'UK', 'Germany', 'India', 'Action', 'Adventure', 'Animation', 'Biography', 'Comedy', 'Crime', 'Documentary', 'Drama', 'Family', 'Fantasy', 'Film-Noir', 'History', 'Horror', 'Music', 'Musical', 'Mystery', 'Romance', 'Sci-Fi', 'Sport', 'Thriller', 'War', 'Western'] Y = income Accuracy (R-squared) for Training Data: 0.5781010972286499 Accuracy (R-squared) for Test Data: 0.5856436989159083
This final regression performs very slightly better than the budget model, but certainly not better enough to justify the amount of data it requires to make predictions. Having this many factors also can lead to overfitting, so it's usually better to stick to simpler models.
In this project, we've learned how to:
Our Analysis of the IMDb dataset has shown us that, from the set of factors we looked at, only movie budget can be used to effectively predict the worldwide gross income of a movie.
Today, we've learned about linear regressions, but there are many other regressions. One of these is logistic regression, which can be used to predict whether something is one of two binary states. For example, if we wanted to predict whether or not a movie was a comedy, we could use a logistic regression based on factors such as year, duration, and/or other genres. Speaking of factors, you may be able to make a model more efficient by using interaction features, a kind of factor that combines multiple factors. For example, let's imagine that Western movies were really popular up until the 1960s. Furthermore, let's imagine that movies made more and more money as time went on. The worldwide gross income of a movie then might depend on the year (higher year means more money) and also on the combination of year and Western (if the year is low and the movie is a Western, the movie will make more money). That combination of year and Western and year is an interaction feature. If you are interested in either logistic regressions or interaction features, try using them to analyze other parts of the IMDb dataset! Or if you're tired of movies, there's a ton of datasets available for free at Kaggle!